Towards a Set of General Purpose Morphosyntactic Tools for Polish
نویسندگان
چکیده
Morphological processing of Polish is seriously hampered by the poor availability of general-purpose tools. This article presents an attempt to create such a set of tools following the de facto standard of the IPIC corpus. Currently, the package contains pieces of software able to perform the following tasks: text tokenisation, morphological analysis with heuristics for unknown words, division into sentences and morphosyntactic disambiguation. The described tools will be made available under the GNU general public licence.
منابع مشابه
Building a Morphosyntactic Lexicon and a Pre-syntactic Processing Chain for Polish
This paper introduces a new set of tools and resources for Polish which cover all the steps required to transform a raw unrestricted text into a reasonable input for a parser. This includes (1) a large-coverage morphological lexicon, developed thanks to the IPI PAN corpus as well as a lexical acquisition techique, and (2) multiple tools for spelling correction, segmentation, tokenization and na...
متن کاملA Tiered CRF Tagger for Polish
In this paper we present a new approach to morphosyntactic tagging of Polish by bringing together Conditional Random Fields and tiered tagging. Our proposal also allows to take advantage of a rich set of morphological features, which resort to an external morphological analyser. The proposed algorithm is implemented as a tagger for Polish. Evaluation of the tagger shows significant improvement ...
متن کاملOptimizing Rule-Based Morphosyntactic Analysis of Richly Inflected Languages - a Polish Example
We consider finite-state optimization of morphosyntactic analysis of richly and ambiguously annotated corpora. We propose a general algorithm which, despite being surprisingly simple, proved to be effective in several applications for rulesets which do not match frequently.
متن کاملMULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora
The paper presents the fourth, “Mondilex” edition of the MULTEXT-East language resources, a multilingual dataset for language engineering research and development, focused on the morphosyntactic level of linguistic description. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes the EAGLES-based morphosyntactic specif...
متن کاملEULIA: a graphical web interface for creating, browsing and editing linguistically annotated corpora
In this paper we present EULIA, a tool which has been designed for dealing with the linguistic annotated corpora generated by a set of different linguistic processing tools. The objective of EULIA is to provide a flexible and extensible environment for creating, consulting, visualizing, and modifying documents generated by existing linguistic tools. The documents used as input and output of the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008